Accelerating Queries on Very Large Datasets
نویسندگان
چکیده
In this chapter, we explore ways to answer queries on large multi-dimensional data efficiently. Given a large dataset, a user often wants to access only a relatively small number of the records. Such a selection process is typically performed through an SQL query in a database management system (DBMS). In general, the most effective technique to accelerate the query answering process is indexing. For this reason, our primary emphasis is to review indexing techniques for large datasets. Since much of scientific data is not under the management of DBMS systems, our review includes many indexing techniques outside of DBMS systems as well. Among the known indexing methods, bitmap indexes are particularly well suited for answering such queries on large scientific data. Therefore, more details are given on the state of the art of bitmap indexing techniques. This chapter also briefly touches on some emerging data analysis systems that don’t yet make use of indexes. We present some evidence that these systems could also benefit from the use of indexes.
منابع مشابه
Real-time Log Query Interface for large datasets using Apache Spark
1University of California, Los Angeles {sandha,xinxu129,yuexin,lizhehan}@cs.ucla.edu ABSTRACT Log Query Interface is an interactive web application that allows users to query the very large data logs of MobileInsight easily and efficiently. With this interface, users no longer need to talk to the database through command line queries, nor to install the MobileInsight client locally to fetch dat...
متن کاملFastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science
FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for onedimensional queries. Compared with oth...
متن کاملCaching Stars in the Sky: A Semantic Caching Approach to Accelerate Skyline Queries
Multi-criteria decision making has been made possible with the advent of skyline queries. However, processing such queries for high dimensional datasets remains a time consuming task. Real-time applications are thus infeasible, especially for non-indexed skyline techniques where the datasets arrive online. In this paper, we propose a caching mechanism that uses the semantics of previous skyline...
متن کاملAn Optimized Data Structure for High Throughput 3D Proteomics Data: mzRTree
As an emerging field, MS-based proteomics still requires software tools for efficiently storing and accessing experimental data. In this work, we focus on the management of LC-MS data, which are typically made available in standard XML-based portable formats. The structures that are currently employed to manage these data can be highly inefficient, especially when dealing with high-throughput p...
متن کاملDeclarative and Efficient Querying on Protein Secondary Structures
In spite of the many decades of progress in database research, surprisingly scientists in the life sciences community still struggle with inefficient and awkward tools for querying biological datasets. This work highlights a specific problem involving searching large volumes of protein datasets based on their secondary structure. In this chapter we define an intuitive query language that can be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008